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Abstract 

We present a support vector machines (SVM) rationale suitable for regression and qua- 
ternary classification problems that use complex data, exploiting the notions of widely 
linear estimation and pure complex kernels. The recently developed Wirtinger's calculus 
on complex RKHS is employed in order to compute the Lagrangian and derive the dual 
optimization problem. We prove that this approach is equivalent with solving two real 
SVM tasks exploiting a specific real kernel, which is induced by the chosen complex kernel. 

Keywords: Support Vector Machines, Kernel methods, Widely linear estimation, com- 
plex data 
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1. Introduction 

The support vector machines (SVM) framework has become a popular toolbox for address- 
ing non-linear classification and regression tasks. The excellent performance of SYMs was 
firmly grounded in the context of statistical learning theory (or VC theory as it is also 
called, giving credit to Vapnik and Chervonenkis, who developed it), which ensures their 
fine generalization properties. Today, support vector classifiers are amongst the most effi- 
cient algorithms for treating a large number of real world applications. In the context of 
regression, this toolbox is usually known as Support Vector Regression (SVR). 

In its original form, the SVM method is a nonlinear generalization of the Generalized 
Portrait algorithm, which has been developed in the former USSR in the sixties. The 
introduction of non-linearity was carried out via a comput ationally elegant way known 
today to the machine learning community as the kernel trick Scholkopf and Smola ( 20021 ). 
Usually, this trick is applied in a black-box rationale: 
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"Given an algorithm which is formulated in terms of dot products, one can 
construct an alternative algorithm by replacing each one of the dot products 
with a positive definite kernel k." 

The successful application of the kernel trick in SVMs has sparked a new breed of techniques 
for addressing non linear tasks, the so called kernel-based methods. Currently, kernel-based 
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In kernel-based methods, the notion of the Reproducing Kernel Hilbert Space (RKHS) 
plays a significant role. The original data are transformed into a higher dimensional RKHS 
y. (possibly of infinite dimension) and linear tools are applied to the transformed data 
in the so called feature space Ti. This is equivalent to solving a non- linear problem in 
the original space. Furthermore, inner products in Ti can efficiently be computed via the 
specific kernel function k associated to the RKHS Ti, disregarding the actual structure of 
the space. Recently, this rationale has been generalized, so that the task simultaneously 
learns the so called kernel in some fashion, instead of selecting it a priori, in the context 
of Mu lt iple Kernel Learning (M KL) iBach et al.l (|2004l ): iBachI (|2008l ) : lG5nen and Alpavdin 



(|201ll ): ISonnenburg et al.l (|200()l ). 



Although the theory of RKHS has been developed by the mathematicians for general 
complex spaces, most kernel-based methods employ real kernels. This is largely due to the 
fact that many of them originated as variants of the original SVM formulation, which was 
targeted to treat real data. However, in modern applications complex data arise frequently 
in areas as diverse as communications, biomedicine, radar, etc. The complex domain pro- 
vides a convenient and elegant representation for such data, but also a natural way to 
preserve their characteristics and to handle transformations that need to be performed. 
Hence, the design of SVMs suitable for treating problems of complex and/or multidimen- 
sional outputs has attracted some attention in the machine learning community. Perhaps 
the most complete works, which attempt to generalize the SVM rati onale in this fashion, 
are a) the Cli fford SVMs Bavro-Corrochano and Arana-Daniell ( 2O10l ) and b) the division 

'~' ' I2OI2I ). Intl 



algebraic SVR lShilton and Lail (|2007l ): IShilton et al.l (120101 . 12011 ). In the Chfford SVMs, the 



authors use Clifford algebras to extend the SVMs framework to multidimensional outputs. 
Clifford algebras belong to a type of associative algebras, which are used in mathematics 
to generalize the complex numbers, quaternions and several other hypercomplex number 
systems. On the other hand, in the division algebraic SVR, division algebras are employed 
for the same purpose. These are algebras, closely related to the Clifford ones, where all non- 
zero elements have multiplicative inverses. In a nutshell, Clifford algebras are more general 
and they can be employed to create a general algebraic framework (i.e., addition and mul- 
tiplication operations) in any type of vector spaces (e.g., M, M^, M^, . . . ), while the division 
algebras are only four: the real numbers, the complex numbers (M?), the quaternions (M^) 
and the octonions (M^). This is due to the fact that the need for inverses can only be satisfied 
in these four vector spaces. Although Clifford algebras are more general, their limitations 
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(e.g., the lack of inverses) make them a difficult tool to work with, compared to the division 
algebras. Another notabl e attem pt that pursue similar goals is the multiregression SVMs of 
Sanchez-Fernandez et all ( 20041 ). where the outputs are represented simply as vectors and 



an e-insensitive loss function is adopted. Unfortunately this approach does not result to a 
well defined dual problem. In c ontrast to the more gen eral case of hyper-complex outputs, 
where applications are limited IChe Uiang et al.l (|201ll ) , complex val ued SVMs ha, v e bee n 
adopted by a numb er of authors for the beamforming problem (e.g., Ramon et al. ( 20051 ): 



Gaudes et al.l (j2007l )). although restricted to the simple linear case. 



It is important to emphasize that all the aforementioned efforts to apply the SVM 
rationale to complex and hypercomplex numbers are limited to the case of the output data. 
These methods consider a multidimensional output, which can be represented, for example, 
as a complex number or a quaternion, while the input data are real vectors. Moreover, 
they employ real valued kernels to model the input-output relationship, breaking it down 
to its multidimensional components. However, in this way many of the rich geometric 
characteristics of complex and hypercomplex spaces are lost. In this paper we propose a 
different approach to the problem of generalizing the SVM framework to complex spaces. 
Our modeling takes place directly into complex RKHS, which are generated by pure complex 
kernels, instead of real ones. In that fashion, the geometry of the complex space is preserved. 
To be inline with the current trend in complex signal processing, we employ the widely linear 
estimation proce s s, whic h it has been shown to perform better than the st andard linear one 
Bouboulis eVgll (l2012bll: lAdah and l] (|20ld '): iNovev and Adalil ()2008l ^: iMandic and Goh 



(j2009l ): iKuh and Mandid (|2009l ). This means that we model the input-output relationship 
as a sum of two parts. The first is linear with respect to the input vector, while the 
second is linear with respect to its conjugate. Moreover, we show that in the case of 
complex SVMs, the widely linear approach is a necessity, as the alternative would lead 
to a significantly restricted model. In order to compute the gradients, which are required 
by the Karush-Kuhn- T ucker conditions and the dual, we employ the generalized Wirtinger 
Calculus introduced in Bouboulis and TheodoridisI (|201lh . 



As one of our major result, we prove that working in a complex RKHS H, with a pure 
complex kernel kq, is equivalent to solving two problems in a real RKHS Ti, albeit with a 
specific real kernel kr, which is induced by the complex kq. It must be pointed out that 
these induced kernels are not trivial. For example, the exploitation of the complex gaussian 
kernel results to an induced kernel different from the standard real gaussian RBF. Our 
emphasis in this paper is to outline the theoretical development and to verify the validity of 
our results via some simulation examples. The paper is organized as follows. In Section [2] 
the main mathematical background regarding RKHS is outlined and the differences between 
real and complex RKHS's are highlighted. Section [3] describes the standard real SVM and 
SVR algorithms. The main contributions of the paper can be found in Sections H] and 
[5l where the theory and the generalized complex algorithms are developed. The complex 
SVR developed there, is suitable for general complex valued function estimation problems 
defined on complex domains. The proposed complex SVM rationale, on the other hand, is 
suitable for Quaternary classification (i.e., four classes problem), in contrast to the binary 
classification carried out by the real SVM approach. Experiments are presented in Section 
[6l Finally, section [7] contains some concluding remarks. 
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2. Real and Complex RKHS 



Throughout the paper, we will denote the set of all integers, real and complex numbers 
by N, M and C respectively. The imaginary unit is denoted a s i. Ve ctor or matrix valued 
quantities appear in boldfaced symbols. A RKHS Aronszajn ( 195d ) is a Hilbert space % 
over a field F for which there exists a positive definite function k : x — t- F with the 
following two important properties: a) For every x £ X, k{-,x) belongs to Ti and b) k has 
the so called reproducing property, i.e., f{x) = (/, rc))-^, for all f £ Ti, in particular 
n{x,y) = y), x))-^. The map ^ : X ^ Ti : = k,{-,x) is called the feature map 
of T-L. Recall, that in the case of complex Hilbert spaces (i.e., F = C) the inner product is 
sesqui-linear (i.e., linear in one argument and antilinear in the other) and Hermitian. In the 
real case, the symmetry condition implies K{x,y) = (k(-, y), k(-, x))-^ = (k(-, x), k(-, y))-^. 
However, since in the complex case the inner product is Hermitian, the aforementioned 
condition is equivalent to K{x,y) = ((«(•, x), k(-, y))-^)* = K*{y,x). In the following, we will 
denote by H a complex RKHS and by ^ a real one. Moreover, in order to distinguish the the 
two cases, we will use the notations kk and ^'m to refer to a real kernel and its corresponding 
feature map, instead of the notation which is reserved for pure complex kernels. 

Definitely, the most popular real kernel in the literature is the Gaussian radial ba- 
sis function, i.e., KRi^^t{x,y) := exp (— t ^^^j^(xfc — j/a;)^) , d efined for x,y g M'" , wher e 
t is a free positive parameter. Many more can be found in IScholkqpf and Smolal (120021) : 



Theodoridis and KoutroumbasI ( 20081 ) : Ishawe- Taylor and Cristianini (2004); Bouboulis and MavroforakisI 



(j201ll ). Correspondingly, an important complex kernel is the complex Gaussian kernel, 



which is defined as: KC'',t{z,w) := exp {—tJ2'k=i{^k — '^k)'^)^ where z,w £ C^, Zk denotes 
the k-th. component of the complex vector z £ and exp(-) is the extended exponen- 
tial fun ction in t he co mplex domain. Other examples include the Bergman and the Szego 
kernels IPaulsenI toO^ ) . 

Besides the complex RKHS produced by the associated complex kernels, such as the 
aforementioned ones, one may construct a complex RKHS as a cartesian product of a real 
RKHS with itself, in a fashion similar to the identification of the field of complex numbers, 
C, to M?. This technique is called complexification of a real RKHS and the respective Hilbert 
space is called complexified RKHS. Let X C and define the spaces X"^ = X x X CI M?'^ 
and X = {a; + iy,x,y £ X} C C, where the latter is equipped with a complex product 
structure. Let he a real RKHS associated with a real kernel defined on X^ x X^ 
and let (•, •)'^ be its corresponding inner product. Then, every f £% can be regarded as a 
function defined on either X"^ or X, i.e., f{z) = f{x + iy) = f{x,y). Moreover, we define 
the cartesian product of T-l with itself, i.e., T-l^ = H xTi. It is easy to verify that T-l^ is also 
a Hilbert Space with inner product 

{f,g)n^ = {f^,9lH + {f\9%, (1) 

for / = {f^,f^), g = {g^,g^)- Our objective is to enrich T-L'^ with a complex structure. To 
this end, we define the space EI = {/ = /*" + i/*; f^, /* £ T-L} equipped with the complex 
inner product: 



(/,5>H = (r,/)w + {fKg'W + i {{f\9lH - {r,9')n) , 



(2) 
" is 



for f = f^ + i/*, g = g^ + ig^. I t is not difficul t to verify that the complexified space 
a complex RKHS with kernel k IPaulsenI Jiooi). We call H the complexification of H. It 
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can readily be seen, that, although IHI is a complex RKHS, its respective kernel is real (i.e., 
its imaginary part is equal to zero). To complete the presentation of the complexification 
procedure, we need a technique to implicitly map the data samples from the complex input 
space to the complexified RKHS H. This can be done using the simple rule: 



$c(^) = '^c{x + iy) = ^c{x, y) = $K(a;, y) + i%(£c, y) 



where is the feature map of the real reproducing kernel 
and z = X + iy. As a consequence, observe that: 

{^c{z),^c{z'))u = 2{^^{x,y),^^{x,y'))-H = 



I, i.e., ^u{x,y) 



(3) 
ix,y)) 



X ,y) 



{x,y)). 



We have to emphasize that a complex RKHS M (whether it is constructed through the com- 
plexification procedure, or it is produced by a complex kernel) can, always, be represented 
as a cartesian product of a Hilbert space with itself, i.e., we can, always, identify H with a 
doubled real space 7i^. Furthermore, the complex inner product of IHI can always be related 
to the real inner product of ^ as in ([2]). 

In order to compute the gradients of real valued cost functio ns, which a r e def ined on 
complex domains, we adopt t he rationale o f Wirt i nger's calculus IWirtingeil ^W2l\). T his 
was brought into light recentlv lXdah and Lil (|20ld '): iNovev and Adalil (j2008l ^ : [liI (|2008l l. as 
a means to compute, in an efficient and elegant way, gradients of real valued cost func- 
tions that are def i ned o n complex domains (C^), in the c ontext of widely linear processing 
Mandic and Gohl toO^h : IPicinbono and Chevaherl (Il995l ). It is based on simple rules and 
principles, which bear a great resemblance to the rules of the standard complex deriva- 
tive, and it greatly simplifies the calculations of the respective derivatives. The difficulty 
with real valued cost functions is that they do not obey the Cauchy-Riemann conditions 
and are not differentiable in the complex domain. The alternative to Wirtinger's calculus 
would be to consider the complex variables as pairs of two real ones and employ the com- 
mon real partial derivatives. However, this approach, usually, is more time co nsuming and 
leads to more cumbersome expressions. In Bouboulis and Theodoridis ( 201ll ). the notion 
of Wirtinger's calculus was extended to general complex Hilbert spaces, providing the tool 
to compute the g radients that are needed to develop kernel-based algorithms for treating 
complex data. In Bouboulis et al. ( 2012al ) the notion of Wirtinger calculus was extended 
to include subgradients in RKHS 



3. Real valued Support Vector Machines 

In this section we briefly describe the popular SVM rational for real-valued data. 



3.1 Support Vector Machines for Classification 

Suppose we are given training data, which belong to two separate classes C_|-, C_ and have 
the form {(a3„, d„); n = 1, . . . , N} C x {±1}. If dn = +1, then the n-th sample belongs 
to C+, while a dn = —1, then the n-th sample belongs to C_. Consider the real RKHS 
Ti with respective kernel k^. We transform the input data from X to Ti, via the feature 
map <&iR. The goal of the SVM task is to estimate the maximum margin hyperplane, 
that separates the points of the two classes as best as possible. As any hyperplane of Ti 
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has the form (/, w)-^ + c = 0, f G for some p arameters t» g 'H, c £ the SVM 
task can be casted as Scholkopf and Smola ( 2002 ): Shawe- Taylor and Cristianini ( 20041 ): 
Theodoridis and Koutroumbas ( 20081 ) : 



N 

minimize ^llu^ll?/ + ^ > £.n 

n=l , , 

subject to |'^n((^M(..),..)«+c) g l-^n 

for n = 1, . . . , A^, 

for some C > 0. This is a constant that determines the trade-off between the two con- 
flicting goals of the SVM task: maximizing the margin (i.e., 2/||'u;|p) and minimizing 
the training error (i.e., X^^=iCn)- The optimization task (j3|) is often called as the C- 
SVM classifier, to distinguish this ca se from other SVM formulations, such as the i/-SVM 
Theodoridis and Koutroumbad ( 20081 ). 



Introducing the Lagrangian and exploiting the KKT conditions we find that the dual 
problem is casted as: 



N , N 



maximize y^On - anamdndml<'M.{Xm-,Xn) 
n=l n,m=l 
N 

subject to ^^a^dn = and a„ £ [0, C /N]. 



(5) 



n=l 

Furthermore, the solution can be shown to have an expansion: 

N 



•^n ) 



W = 

n=l 

while the threshold c can be computed by averaging 

N 



n=l 

over all points with < < C, for m = 1, . . . , A^. 



3.2 Support Vector Regression 

In a more general setting, the outputs dn may take several values or they may be real 
numbers. In the latter case, we are trying to estimate an input-output relationship between 
Xn and dn- The SVM rationale can be modified to accommodate this set-up as follows. 
Suppose we are given training data of the form {{xn, dn)', n = 1, . . . , N} C <^ x M, where 
X = denotes the space of input patterns. Furthermore, let Ti he a real RKHS with 
kernel Kjg. We transform the input data from X to Ti, via the feature map ^k, to obtain 
the data {($K(ic„), (i„); n = 1,...,A^}. In support vector regression, the goal is to find 
an affine function T : H — s- M : T(/) = (/, il')-^ + c, for some w £ Ti, c £ M, which is as 
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flat as possible and has at most e deviation from the actually obtained values for all 
n = 1, . . . ,N. Observe that at the training points ^^{xn), T takes the values r($]R(a;„)). 
Thus, this is equivalent with finding a non-linear function g defined on X such that 



g{x) = To <l>^{x) = {<^M{x),w)n + c, 



(6) 



for some w £ H, c £ M, which satisfies the aforementioned properties. The usual formulation 
of this problem as an optimization task is the following: 



N 



minimize 



subject to 



w 



n=l 



dn - {^R{Xn),w)n - C < € + 



(7) 



for n = 1, . . . , A^. The constant C determines a tradeoff between the tolerance of the estima- 
tion (i.e., how many larger than e deviations are tolerated) and the fiatness of the solution 
(i.e., r) . This corresponds to the so called e-in s ensiti ve loss function = max{0, |^| — e}, 
Vaonikl (jl999l ): Theodoridis and KoutroumbasI (120081 1. 

To solve ([7]), one considers the dual problem derived by the Lagrangian: 



maximize 

a,a 



N 



n,m=l 

N N 

^(a„ + an) + ^(i„(a„ - a„) 

n=l n=l 

subject to ^^(fln — On) = and a„,a„ G [0, C/N]. 



N 



(8) 



n=l 



Note that a„ and a„ are the Lagrange multipliers corresponding to to the first two inequal- 
ities of problem (jlSp . for n = 1,2, . . . , N . Exploiting the saddle point conditions, it can be 



proved that w = '^n=ii'^ri — dn)^{xn) and thus the solution becomes 

N 



f{x) = ^(a„ - an)nm.{xn, x) + c. 



(9) 



n=l 



Furthermore, exploiting the Karush-Khun-Tuker (KKT) conditions one may compute the 
parameter c. 

Several algorithms have been proposed for solving the SVM and SVR ta sks, amongst 
which are Piatt's celeb rated Sequential M inimal Optimization (SMO) algorithm iPlattI (1 199811 . 
i nterio r point methods Vanderbei ( 1994 ) , g e omet ric algorithms Mavroforakis and Theodoridid 



(1200611 ;lMavroforakis et al 



(12007 l; lLazard (j201ll l and methods suitable for large scale prob- 



lems |Collobert_aiid^^en^wyl20^ detailed description of the SVR machinery can 



be found in ISmola and Scholkopfl (jl998l l 
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4. Complex Support Vector Regression 

We begin the treatment of the complex case with the complex SVR rationale, as this is 
a direct generalization of the real SVR. Suppose we are given training data of the form 
{(z„, dn); n = 1, . . . , N} C X x C, where X = denotes the space of input patterns. As 
Zn is complex, we denote by £c„ its real part and by its imaginary part respectively, i.e., 
Zn = Xn + n = 1, . . . , N . Similarly, we denote by and the real and the imaginary 
part of dn, i.e., (i„ = + id^, n = I, . . . , N. 



4.1 Dual Channel SVR 

A straightforward approach for addressing this problem (as well as any problem related with 
complex data) is by considering two different problems in the real domain. This technique 
is usually referred to as the Dual Real Channel (DRC) approach. That is, the training data 
are split into two sets {((a;„, d^); n = 1, . . . , N} C M^*^ x M and {((a;„, d^); n = 
1, . . . , A^} C M^*^ X R, and perform support vector regression to each set of data using a 
real kernel kk and its corresponding RKHS. We will show in th e following sections that the 
DRC approach is equivalent to the complexification procedure Bouboulis and Theodoridii 



(mil) described in section [2j The latter, however, often provides a context that enables 



us to work with complex data in a more compact form, as one may employ Wirtinger's 
Calc ulus to compute the respect i ve gr adients and develop algorithms directly in complex 



form iBouboulis and TheodoridisI (|201ll ) 



In contrast to the complexification procedure, we emphasize that the pure complex 
approach (where one directly exploits a complex RKHS) considered in the next section 
is quite different from the DRC rationale. We will develop a framework for solving such 
a problem on the complex domain employing pure complex kernels, instead of real ones. 
Nevertheless, we will show that using complex kernels for SVR is equivalent with solving 
two real problems using a real kernel. This kernel, however, is induced by the selected 
complex kernel and it is not one of the standard kernels appearing in machine learning 
literature. For example, the use of the complex Gaussian kernel induces a real kernel, which 
is not the standard real Ga ussian RBF (see figure [T]). As it has already been demonstrated 



m 



Boubouhs etHI (l2012al lbh. although in a different context than the one we use here. 



the DRC approach and the pure complex approaches give, in general, different results. 
Depending on the case, the pure complex approach might show increased performance over 
the DRC approach and vice versa. 



4.2 Pure Complex SVR 

Prior to the development of the generalized complex SVR rationale, we investigate some 
significant properties of the complex kernels. In the following, we assume that EI is a 
complex RKHS with kernel kc. We can decompose into its real and imaginary parts, 
i.e., K,c{z,z') = k'^{z,z') + iKl^{z,z'), where k'^{z, z'), kI^{z, z') G R. As any complex 
kernel is Hermitian (see section[2]), we have that k^{z,z') = k.c{z',z) and hence we take 

Kl:{z,z') = K'ciz',z), (10) 

Khiz,z') = -K^iz',z). (11) 
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Lemma 1 The imaginary part of any complex kernel, kq, satisfies: 

N 



^ ^ CnCml^ci^nT —0, (12) 



n,m=l 

for any N > and any selection o/ ci, . . . , cat £ C and Zi, . . . , € X . 

Proof Exploiting equation pip and rearranging the indices of the summation we get: 

N N N 



n,m=l n,m=l m,n=l 

N 

Hence, 2 CnCm/^cl-^n' -^m) ~ '^^'1 result follows immediately. 



n,m=l 



Lemma 2 If kc{z, z') is a complex kernel defined on x , then its real part, i.e., 

=Re(Kc(z,^')), (13) 

where z = x + iy, z' = x' + iy' , is a real kernel defined We call this kernel 

the induced real kernel of kc- 



Proof As relation (jlOp implies, is symmetric. Moreover, let > 0, ai, . . . ,aN G 
and zi, . . . , Z]\f £ X . As kc is positive definite, we have that 



N 



n,m=l 

However, splitting K£ to its real and imaginary parts and exploiting Lemma [TJ we take 

N N N 



anamKciZn, Zm) = anam^H^n, ^m) + i 

TV 



O-nOiml'^CyZm Zm) 



n,m=l n,m=l n,m=l 

N 



n,m=l 

N 

Hence, 

cinQ^m'^c (-^m -^m) ^ 0. As a last step, recall that may be regarded as defined 

n, 771=1 

either on C x C or M^t' x M^;^. This leads to 

N 



|:«„a„.4((;;),(;:))>o. 



n,m=l 
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We conclude that f^(P IS Si positive definite kernel on 



t,2u 



a2u 



At this point we are ready to present the SVR rationale in complex RKHS. We transform 
the input data from X to H, via the feature map $C) to obtain the data {{^c{zn),dn); n = 
1, . . . ,N}. In analogy with the real case and extending the principles of widely linear es- 
timation to complex support vector regression, the goal is to find a function T : EI — t- C : 
T{f) = {f,w)u + {f*iv)u + c, for some u,v £ M, c € C, which is as fiat as possible and 
has at most e deviation from both the real and imaginary parts of the actually obtained 
values dn, for all n = 1,. . . ,N. We emphasize that we employ the widely linear estima- 
tion function 5"! : IHI — C : -S'i(/) = {f,w)M + {f*,'(^)m instead of the usual complex linear 
functi on 52 : H — C : Si{f) = (/, w)h following the ideas o f iPicinbono and Chevalier 
( 19951). which are becoming p opular in co i nplex s ignal processing iTook and Mandij (|2ninf k 



Chevalier and PiponI (|2006l ): iJeon et al.l (120061'): ICacciapuoti et al.l (120081 ) and have been 
gener alized for the case of complex RKHS in iBouboulis et al. ( 2012a ). It has been estab- 



lished Picinboiiol ( 19941 ): Picinbono and Bondon ( 199?! ). that the widely linear estimation 
functions are able to capture the second order statistical characteristics of the input data, 
which are nece ssary if non-circ u laiB in put sources are considered. Furthermore, as it has 
been shown in Bouboulis et al. ( 2012bl ). the exploitation of the traditional complex linear 
function excludes a significant percentage of linear functions from being considered in the 
estimation process. The correct and natural linear estimation in complex spaces is the 
widely linear one. 

Observe that at the training points ^c{zn), T takes the values T{^c{zn))- Following 
similar arguments as with the real case, this is equivalent with finding a complex non-linear 
function g defined on X such that 



g{z) =To $c(2) = {<^ciz),w)M + {<^*c{z),v)m + c, 



(14) 



for some WjV € H, c G C, which satisfies the aforementioned properties. We formulate the 
complex support vector regression task as follows: 



mm 

w,v,c 



S. t. 



N 



n=l 





+ C 


- dn) 


< 




Re{dn - {^c{Zn),w)M - {^ci^n 




I - c) 


< 


e + e; 


lm{{^c{zn),w)M + {^c{zn),v)m 


+ c 


-dn) 


< 


e + a 


lm{dn - {^c{Zn),U!)m - {^*c{Zn 




I - c) 


< 


e + el 


CT tr ci ci 
>rn sn> 'sn; Sn 






> 






(15) 



1. Note that the issue of circularity has become quite popular recently in the context of complex adaptive 
filtering. Circularity is intimately related to rotation in the geometric sense. A complex random variable 
Z is called circular, if for an y angle 4) both Z and Z e"^ (i.e., the rotation of Z by angle 0) follow the 
same probability distribution iMandic and Gohl (|2009l ). 
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Pure Complex SVR 
with Complex Kernel kc 







SVR 












Induced Real Kernel k..^ 




output : a„,a„,c' 





I 



SVR 

Induced Real Kernel n'^ 



output : 6„, b,(,c' 



Combine the soiutio 



Figure 2: Pure Complex Support Vector Regression. The difference with the dual channel 
approach is due to the incorporation of the induced real kernel , which depends 
on the selection of the complex kernel kq . In this context one exploits the complex 
structure of the space, which is lost in the dual channel approach. 
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To solve (jlSp . we derive the Lagrangian and the KKT conditions to obtain the dual 
problem. Thus we take: 



N 



N 



n=l 



+ ^an{^e{{^c{Zn),w)u + {^c{Zn),v)u + C- dn) - ^ - Cn) 

n=l 
N 

+ ^an{Re{dn - {^c{Zn),w)M - (^>c(^n),w)H - c) - € - 

n=l 
AT 

n=l 
N 

+ ^6n(Im(dn - {<^c{zn),w)M - {<^*c{Zn) , v)m - c) - e + ^1) 

AT AT AT AT 



n=l 



n=l 



n=l 



n=l 



n=l 



(16) 



where a„, a„, 5„, 6„, r/„, fjn, On, On are the Lagrange multipliers. To exploit the saddle 
point conditions, we employ the rules of Wirtinger's Calculu s for t he complex variables on 
complex RKHS's as described in lBouboulis and TheodoridisI (|201lh and deduce that 



dw* ^ 2 E an^c(^n) - ^5 E ^n^d^n) " ^ E ^n^d^n) + 1^ E bn'^ciZn), 



n=l 



n=l 



n=l 



n=l 



_i 1 

dv* ~2^ 2 



AT ^ Af . Af . Af 



n=l 



n=l n=l 



n=l 



N 



N . N . N 



9?=2E«--2E««+2E^"-2E^- 

n=l n=l n=l n=l 

For the real variables we compute the gradients in the traditional way: 



dC C 

= TV ~ ~ ^ = F ~ ~ 
ac?. - AT 



ac _ c 



for all ?i = 1, . . . , A^. 

As all gradients have to vanish for the saddle point conditions, we finally take that 



TV 



Af 



=E("" ~ an)^c{Zn) - ^^{bn " hn)^c{Zn): 
n=l n=l 

N N 

V =^(a„ - a„)^l{zn) - i^(6„, - 6„)$c(^n)> 



(17) 
(18) 



n=l 



n=l 
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N N 



J^(a„ - a„) = ^{bn - hn) = 0, (19) 



n=l n=l 



C ^ C 

fin = — dn, Vn = Jf ~ O-n, 

n - c _i, a - c _i l^uj 

n — jv n — 

for n = 1, . . . , TV. 

To compute \\w\\^ = {'w,w)m, we apply equation p!7|) . Lemma[Tl the reproducing prop- 
erty of M, i.e., ($(z„),$(2;^))e = kc(2; m,^;^), and the sesqui-hnear property of the inner 
product of EI to obtain that: 

N N 

\\M\m= X] {ctn - an)iam - am)Kc{Zm,Zn) + ^ {K - K){hm - hm)Hc{^m-, Zn) 
n,m=l n,m=l 
TV 

+ 2 y ^ (an — CLn)(bm ~ bm)f^c{Zrm Zn). 
n,m=l 

Similarly, we have 

N N 



I'^IIh — (o-n — CLn){am — flm)'^c(-^m' -^n) + ~ bn){bm — bm)l^c{Zm, Zn) 

n,m=l n,m=l 
N 

~2 {a^ — an)(brn ~ brn)l^ciZrn, Zn), 



n,m=l 



and 



— / ^ (,fflm ~ 0,m)K£\Zm, -2nj + H / ^ — Om)K£{Zmj Zn)- 



m=l m=l 



Eliminating r]n, fjn, On, On via (I20p and via the aforementioned relations, we obtain 
the final form of the Lagrangian: 



N N 



n,m=l n,m=l 

N N N 



n=l n=l ri=l 
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where d^, are the real and imaginary parts of the output dn, n = 1, . . . ,N. This means 
that we can spht the dual problem into two separate maximization tasks: 



and 



N 



maximize < 



n,m=l 

N N 

e^(a„ + an) + '^d^ian - 

n=l n=l 

subject to ^^(on — On) = and a„,a„ G [0, C/N], 



n=l 



(22a) 



maximize < 
b,b 



N 



(fin — bn)(bm ~ bm)^c(-^m) -^n) 



n,m=l 

AT 



AT 



n "n I 



(22b) 



n=l 



n=l 



subject to 



AT 

E 

n=l 



{bn - fen) = and 6„, 6„ G [0, C/iV]. 



Observe that (122aP and (|22bp . are equivalent with the dual problem of a standard real 
support vector regression task with kernel 2ti^. This is a real kernel, as Lemma[2]establishes. 
Therefore (figure [2]), one may solve the two real SVR tasks for a^, o„ and c^, 6^1 c* 
respectively, using any one of the algorithms, which have been developed for this purpose, 
and then combine the two solutions to find the final non-linear solution of the complex 
problem as 



N N 
=2 ^(a„ - an)Kc(z„, Z) + 2i ^(6„ - bn)Kc{Zn, z) + 



(23) 



n=l 



n=l 



In this paper we are focusing mainly in the complex Gaussian kernel. It is important to 
emphasize that, in this case, the induced kernel is not the real Gaussian RBF. Figured] 
shows the element (Oj*-*)"^) of the induced real feature space. 

Remark 3 For the complexification procedure, we select a real kernel and transform the 
input data from X to the complexified space H, via the feature map to obtain the data 
{(i>c(2„), dn); n = 1, . . . , A^}. Following a similar procedure as the one described above and 
considering that 

{^c{Zn),^c{Zni))u = 2K^{Zm,Zn) 



we can easily deduce that the dual of the complexified SVR task is equivalent to two real 
SVR tasks employing the kernel 2km . Hence, the complexification technique is identical to 
the DRC approach. 
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5. Complex Support Vector Machines 

Recall that in any real Hilbert space %, a hyperplane consists of all the elements / G H 
that satisfy 

{fMH + b = ^, (24) 

for some w G "H, 6 G M. Moreover, as figm'e[3]shows, any hyperplane of % divides the space 
into two parts, ■}{+ = {/£ H; {f,w)n + 6 > 0} and = {f e %; {f,w)'u + 6 < 0}. 
In the traditional SVM classification task, which has been outlined in section [3l the goal 
is to separate two distinct classes of data by a maximum margin hyperplane, so that one 
class falls into T-Lj^ and the other into %- (excluding some outliers). In order to be able to 
generalize the SVM rationale to complex spaces, firstly, we need to determine an appropriate 
definition for a complex hyperplane. The difficulty is that the set of complex numbers is 
not an ordered one, and thus one may not assume that a complex version of (I24p divides 
the space into two parts, as ^+ and %- cannot be defined. Instead, we will provide a novel 
definition of complex hyperplanes that divide the complex space into four parts. This will 
be our kick off point for deriving the complex SVM rationale, which classifies objects into 
four (instead of two) classes. 

Lemma 4 The relations 

Re((/,u;)H + c) =0, (25a) 
Im((/,7i;)H + c) =0, (25b) 

for some w G H, c G C, where / G H, represent two orthogonal hyperplanes of the doubled 
real space, i.e., Ti."^ , in general positions. 

Proof Observe that 

(/, w)u = if, w^n + {f,w% + K{f\w')H - {r,nj^)n), 
where f = f^ + i/*, w = w'^ + iw^. Hence, we take that 

where c = + ic*. These are two distinct hyperplanes of T-L^. Moreover, as 

(-^ -^)(^:)=o, 

the two hyperplanes are orthogonal. ■ 



Lemma 5 The relations 

Re{{f,w)M + {f*,v)M + c) = 0, (26a) 
lm{{f,w)M + {f*,v)m + c) = 0, (26b) 

for some w,v £ M, c £ C, where / G H, represent two hyperplanes of the doubled real space, 
i.e., . Depending on the values of w,v, these hyperplanes may be placed arbitrarily on 
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Proof Following a similar rationale as in the proof of lemma [U we finally take 



and 



W2 



where f = + i/*, w = + iw^, v = v'^ + if*, c = + ic*. ■ 

The following definition comes naturally. 

Definition 6 Let M be a complex Hilbert space. We define the complex couple of hyper- 
planes as the set of all f £M that satisfy one of the following relations 

Rei{f,w)M + {f*,v)m + c) = 0, (27a) 
lmi{f,w)M + {f*,v)M + c) = 0, (27b) 

for some w,v gM., c G C. 

Lemmas |4] and [5] demonstrate the significant difference between complex linear estima- 
tion and widely linear estimation functions, which has been, already, pointed out in section 
14.21 albeit in a different context. The complex linear case is quite restrictive, as the couple 
of complex hyperplanes are always orthogonal. On the other hand, the widely linear case 
is more general and covers all cases. The complex couple of hyperplanes (as defined by 
definition [6]) divides the space into four parts, i.e., 

Re{{f,w)M + {f*,v)m + c)>0, 
Im{{f,w)M + {f*,v)M + c) >0 

Re{{f,w)M + {f*,v)m + c)>0, 
lm{{f,w)M + {f*,v)B + c) <0 

Re{{f,w)M + {r,v)m + c) <0, 
lm{{f,w)M + {f*,v)u + c) >0 

nj =\f^TJ Re((/,u;)H + (/*,t')H + c) < 0, 
\^ Im((/,u;)H + (r,t^)H + c) <0 

Figure H] demonstrates a simple case of a complex couple of hyperplanes that divides C into 
four parts. Note, that, in some cases, the complex couple of hyperplanes might degenerate 
into two identical hyperplanes or two parallel hyperplanes. 

The complex SVM classification task can be formulated as follows. Suppose we are given 

training data, which belong to four separate classes C++, C_| , C |_, C , i.e., {{zn, dn); n = 

1, . . . , A^} C X X {±1 lb i)}. If dn = +1 + i, then the n-th sample belongs to C++, i.e., 

Zn S C++, if dn = 1 — i, then Zn € C-\ , if dn = — 1 + i, then Zn G C |- and if d„ = — 1 — i, 

then Zn G C Consider the complex RKHS IHI with respective kernel kq. Following a 



n++ = ^fen; 



n^+ = {fen; 
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Figure 3: A hyperplane separates the space T-L into two parts, Hj^ and Ti.-. 




Figure 4: A complex couple of hyperplanes separates the space of complex numbers (i.e., 
H = C) into four parts. 
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similar rationale to the real case, we transform the input data from X to H, via the feature 
map $c- The goal of the SVM task is to estimate a complex couple of maximum margin 
hyperplanes, that separates the points of the four classes as best as possible (see figure [5]). 
Thus, we need to minimize 



w + V 



+ 



V V 

w — V 



\w'' + v''\\'^ + \\w' 



'^\\w''\\n + 2\\w'\\'^ + 2\\v''\\'^ + 2\\v'\\'^ = 
2{\\w\g + \\v\g). 

Therefore, the primal complex SVM optimization problem can be formulated as 



N 



mm 

■w,v,c 



n=l 



d;;Re(($c(zn),'w)e + (^c(^"),i'>H + c) > 1 - 
s. to ^ <Im(($c(2;n),i«)e + (^>c(^n)''^)iHi + c) > 

for n = 1, . . . , A^. 

The Lagrangian function becomes 



(28) 



1 1 C ^ 

Liw,v,a,a,b,b) = -IkHe + -||v||e + + 



n=l 



N 



^a„, (d; Re {{^c{Zn),w)M + ($£(^n), ^^)H + c) - 1 + CD 
n=l 
N 



n=l 
N N 

71=1 n=l 

where an,bn,rin,9n are the positive Lagrange multipliers of the respective inequalities, for 
n = 1, . . . , N . To exploit the saddle point conditions of the Lagrangian function, we employ 
the rules of Wirtinger's Calculus to compute the respective gradients. Hence, we take 

- ^ an<$c(^n) + ^Y1 bndi^ciZn) 



dL _ 1 

~ 2^ 2 



w 



dL _l 1 

'v* ~ 2" 



n=l 

N 



n=l 

N 



dL 

c* 



n=l 
N 



n=l 



W 



n=l 



N 



n=l 
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. dL C dL C 

for n = 1, . . . , N . As all the gradients have to vanish, we finally take that 

N N N N 



W 



n=l 



n=l 



n=l 



n=l 



and a„ + r/„ = — , 6„ + 6'„, = — 

for n = 1, . . . ,iV. Following a similar procedure as in the complex SVR case, it turns out 
that the dual problem can be split into two separate maximization tasks: 





TV 


N 


maximize 

a 




\ ^ rr rr 




n=l 




1=1 
' N 


subject to 




< 


n=l 

I < a„ < f 






for n = 1, . . . ,N 



(29a) 



and 



TV 



AT 



maximize - Y 



n=l n,m=l 
N 



subject to 



Ybndi = 



n=l 



< 6„ < ^ 



N 



for n = 1, . . . ,N. 



(29b) 



Observe that, similar to the regression case, these problems are equivalent with two 
distinct real SVM (dual) tasks employing the induced real kernel 2k^. One may split the 
(output) data to their real and imaginary parts, as figure M demonstrates, solve two real 
SVM tasks employing any one of the standard algorithms and, finally, combine the solutions 
to take the complex labeling function: 

g{z) = sign {{<^c{z),w)m + {<^ciz),v)m + c) 



:sign 2^(and^ + ibnd'jK^^izn, z) + d" + ic* 

* \ n=l / 



where sign(2:) = sign(Re(2;)) + isign(Im(2:)). 
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Remark 7 Following the complexification procedure, as in Remark\^ we select a real kernel 
and transform the input data from X to the complexified space H, via the feature map 
$C- can easily deduce that the dual of the complexified SVM task is equivalent to two 
real SVM tasks employing the kernel 2kir. 

Remark 8 It is evident that both the complex and the complexified SVM can be employed 
for binary classification as well. The advantage in this case is that one is able to handle 
complex input data in both scenarios. Moreover, the popular 1-versus-l and 1-versus-all 
strategies, which address multiclassification problems, can be directly applied to complex 
inputs using either the complex or the complexified binary SVM. 

6. Experiments 

In order to illuminate the advantages that are gained, if one exploits complex data and 
to demonstrate the performance of the proposed algorithmic scheme, we compare it with 
standard real-valued techniques, as well as the dual real channel approach using various 
regression and classification tasks. In the following, we will refer to the pure complex kernel 
rationale and the complexification trick, presented in this paper, using the terms CSVR 
(or CSVM) and complexified SVR (or complexified SVM) respectively. The dual real chan- 
nel approach, outlined in section 14. H will be denoted as DRC-SVR. Recall that the DRC 
approach is equivalent to the complexified rationale, although the latter often provides for 
more compact formulas and simpler representations. The following experiments were imple- 
mented in Matlab. The respective code can be found in bouboulis . mysch.gr/kernels . html, 

6.1 Function Estimation 

In this section, we perform a simple regression test on the complex function sinc(2). An 
orthogonal grid of 33 x 9 actual points of the sine function, corrupted by a mixture of white 
Gaussian noise together with some impulses, was adopted as the training data. Figures 
[7] and [8] show the real and imaginary parts of the reconstructed function using the CSVR 
rationale. Note the excellent visual results obtained by the corrupted training data. Figures 
l9l and [TOl compare the square errors (i.e. \dn — sinc(z„)p, where dn is the value of the 
estimated function at 2„) between the CSVR and the DRC-SVR. In this experiment, it 
is evident that the DRC-SVR fails to capture the complex structure of the function. On 
the other hand, the CSVR rationale provides for an estimation function, which exhibits 
excellent characteristics. A closer look reveals that at the border of the training grid the 
square error increases in some cases. This is expected, as the available information, which 
it is exploited by the SVR algorithm, is reduced in these areas, compared to the interior 
points of the grid, making the algorithm more sensitive to outliers. Besides the significant 
decrease in the square error, in these experiments we, also, observed a significant reduction 
in the computing time needed for the CSVR, compared to the DRC-SVR. Both algorithms 
were implemented in MatLab on a computer with a Core 15 650 microprocessor running at 

3.2 GHz. The total computing time for the CSVR and the DRC-SVR tasks were around 
130 and 550 seconds respectively. 

In all the performed experiments, the SMO algorithm was employed using the complex 
Gaussian kernel and the real Gaussian kernel for the CSVR and the DRC-SVR respectively 
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f see IPlattI (|l998l l). The parameters of the kernel for both the complex SVR and the DRC 
SVR tasks were tuned to provide the smallest mean square error. In particular for the 
CSVR, the parameter of the complex gaussian kernel was set io t = 0.25, while for the 
DRC-SVR the parameter was set to t = 4. In both cases the parameters of the SVR task 
were set as C = 1000, e = 0.1. 



6.2 Channel Identification 

In this section, we consider a non- linear channel identification task (see Adali and Li ( 2O10l )). 
This channel consists of the 5-tap linear component: 

5 

t{n) = ^ h{k) ■ s{n-k + 1), 



where h{k) = 0.432 [l + cos (^M^^ " + 



27r(fc-3) 
10 



for k 



(30) 
, 5, and the 



nonlinear component: 



x{n) = t{n) + (0.15 - O.lf)t^(n). 



Sebald and Bucklew (200d): 


Sanchez-Fernandez et al. 


(2011): Bouboulis et al.l ( 


2012b a). At the receiver's e 



gaussian noise and then observed as yn- The level of the noise was set to 15dB. The input 
signal that was fed to the channel had the form 



s{n) = [yT^X{n) + ipy(n)) 



(31) 



where X[n) and Y{n) are gaussian random var iables. This i n put is circular for p = \/2/2 
and highly non-circular if p approaches or 1 lAdah and Lil ^^). The CSVR and the 
DRC-SVR rationales were used to address the channel identification task, which aims to 
discover the input-output relationship between (s(n — L -|- 1), s(n — L -|- 2), . . . , s(n)) and 
y{n) (the parameter L was set to L = 5). In each experiment, a set of 150 pairs of samples 
was used to perform the training. After training, a set of 600 pairs of samples was used 
to test the estimation's performance of both algorithms (i.e., to measure the mean square 
error between the actual channel output, x(n), and the estimated output, x{n)). To find 
the best possible values of the parameters C and t, that minimize the mean square error 
for both SVR tasks, an extensive cross-validation procedure has been employed (see tables 
[H [2]) in a total of 20 sets of data. Figure [TT] shows the minimum mean square error, which 
has been obtained for all values of the kernel parameter t, versus the SVR parameter C, for 
both cases. It is evident, that the CSVR approach significantly outperforms the DRC-SVR 
rationale, both in terms of MSE and computational time (figure [T2]) . Both figures [TT] and 
[T2] refer to the circular case. As the results for the non-circular case are similar, they are 
omitted to save space. 

6.3 Channel Equalization 

In this section, we present a non-linear channel equalization task that consists of the linear 
filter (j30p and the memoryless nonlinearity x{n) = t(n)-|-(0.1 — 0.15i)-t^(n). At the receiver 
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Figure 5: A complex couple of hyperplanes that separates the four given classes. The 
hyperplanes are chosen so that to maximize the margin between the classes. 



c 


t 


1000 


1/62 


2000 


1/62 


5000 


1/82 


10000 


1/92 


20000 


1/112 


50000 


1/132 



Table 1: The values of C and t that minimize the mean square error of the CSVR, for the 
channel identification task. 



c 


t 


1000 


1/42 


2000 


1/52 


5000 


1/62 


10000 


1/72 


20000 


1/72 


50000 


1/102 



Table 2: The values of C and t that minimize the mean square error of the DRC-SVR, for 
the channel identification task. 
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Figure 6: Pure Complex Support Vector Machines. 



end of the channel, the signal is corrupted by white Gaussian noise and then observed as 
y{n). The level of the noise was set to 15dB. The input signal that was fed to the channels 
had the form 

s{n) = 0.30 (^^/T^X{n) + ipY{n)^ , (32) 

where X{n) and Y{n) are gaussian random variables. 

The aim of a channel equalization task is to construct an inverse filter, which acts on 
the output y{n) and reproduces the original input signal as close as possible. To this end, 
we apply the CSVR and DRC-SVR algorithms to a set of samples of the form 

{{y{n + D),y{n + D-l),...,r{y + D-L + 1)), s{n)) , 

where L > is the filter length and D the equalization time delay (in this experiment we 
set L = 5 and D = 2). 

Similar to the channel identification case, in each experiment, a set of 150 pairs of 
samples was used to perform the training. After training, a set of 600 pairs of samples was 
used to test the performance of both algorithms (i.e., to measure the mean square error 
between the actual input, s(n), and the estimated input, s(n)). To find the best possible 
values of the parameters C and t, that minimize the mean square error for both SVR tasks, 
an extensive cross-validation procedure has been employed (see tables (SjH]) in a total of 100 
sets of data. Figure [13] shows the minimum mean square error, which has been obtained for 
all values of the kernel parameter t, versus the SVR parameter C, for both cases considering 
a circular input. The CSVR appears to achieve a slightly lower MSE for all values of the 
parameter C, at the cost of a slightly increased computational time. The results for the 
non-circular case are similar. 



6.4 One versus three multiclass Classification 

We conclude the experimental section with the classification case. We performed two ex- 
periments using the popular MNIST database of handwritten digits LeCun and Cortei . In 
both cases, the respective parameters of the SVM tasks were tuned to obtain the lowest 
error rate possible. The MNIST database contains 60000 handwritten digits (from to 9) 
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Figure 7: The real part Figure 

(Re(sinc(2;))) of the 
estimated sine function 
from the complex SVR. 
The points shown in the 
figure are the real parts 
of the noisy training data 
used in the simulation. 



: The imaginary part 
(Im(sinc(z))) of the esti- 
mated sine function from 
the complex SVR. The 
points shown in the figure 
are the imaginary parts 
of the noisy training data 
used in the simulation. 
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Table 3: The values of C and t that minimize the mean square error of the CSVR, for the 
channel equalization task. 
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Dual Channel SVR - Square Error 




Figure 9: The square error (the ac- 
tual values of the func- 
tion versus the estimated 
ones) for the complex SVR 
of the sine function. The 
mean square error of all 
the estimated values was 
equal to —14:.5dB. 



Figure 10: The square error (the ac- 
tual values of the func- 
tion versus the estimated 
ones) for the Dual Real 
Channel regression of the 
sine function. The mean 
square error of all the es- 
timated values was equal 
to -9.5dB. 



Non linear channel Identification. 

_4 . 




Figure 11: MSE versus the SVR parameter C for both the CSVR and the DRC-SVR ra- 
tionales, for the channel identification task. 
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Non linear channel Identillcatlon. 
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Figure 12: Time (in seconds) versus MSE (dB) for both the CSVR and the DRC-SVR 
rationales, for the channel identification task. 
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Table 4: The values of C and t that minimize the mean square error of the DRC-SVR, for 
the channel equalization task. 
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Non linear channel Equalization. 



DRC-SVR 
CSVR 




Figure 13: MSE versus the SVR parameter C for both the CSVR and the DRC-SVR ra- 
tionales, for the channel equalization task. 
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Figure 14: Time (in seconds) versus MSE (dB) for both the CSVR and the DRC-SVR 
rationales, for the channel equalization task. 
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for training and 10000 handwritten digits for testing. Each digit is encoded as an image 
file with 28 x 28 pixels. The scenario, that it is typically used to quantify the performance 
of an SVM-like learning machine, is to employ a one-versus-all strategy to the training set 
(us ing the raw p i xel va l ues as input data) and t hen measure the success using the testing 
set iLeCun et all (|l998l ): IPecoste and withJ (|2002l ). 

In the first experiment, we compare the aforementioned standard one-versus-all scenario 
with a classification task that exploits complex numbers. In the complex variant, we per- 
form a Fourier transform to each training image and keep only the 100 most significant 
coefficients. As these coefficients are complex numbers, we employ a one-versus-all classifi- 
cation task using the binary complexified SVM rationale (see remark [8l In both scenarios 
we use the first 6000 digits of the MNIST training set to train the learning machines and 
test their performances using the 10000 digits of the testing set. In addition, we used the 
gaussian kernel with t = 1/64 and t = 1/140^ respectively. The SVM parameter C has 
been set equal to 100. The error rate of the standard real-valued scenario is 3.79%, while 
the error rate of the complexified (one-versus-all) SVM is 3.46%. In both learning tasks we 
used the SMO algorithm to train the SVM. The total amount of time needed to perform 
the training of each learning machine is almost the same for both cases (the complexified 
task is slightly faster). 

In section [5l we discussed how the 4-classes problem comes naturally to the complex 
SVM. Exploiting the notion of the complex couple of hyperplanes (see figure H]), we have 
shown that the generalization of the SVM rationale to complex spaces directly assumes 
quaternary classification. Using this approach, the 4 classes problem can be solved using 
only 2 distinct SVM tasks instead of the 4 tasks needed by the 1-versus-all or the 1- versus- 1 
strategies. The second experiment compares the quaternary complex SVM approach to the 
standard 1-versus-all scenario using the first four digits (0, 1, 2 and 3). In both cases we 
used the first 6000 such digits of the MNIST training set to train the learning machines. We 
tested their performance using the digits contained in the testing set. The error rate of the 
1-versus-all SVM was 0.721%, while the error rate of the complexified SVM was 0.866%. 
However, the 1-versus-all SVM task required about double the time for training, compared 
to the complexified SVM. This is expected, as the latter solves half as many distinct SVM 
tasks as the first one. In both experiments we used the gaussian kernel with t = 1/49 and 
t = 1/160^ respectively. The SVM parameter C has been set equal to 100 in this case also. 



7. Conclusions 

We presented a framework of support vector regression and quaternary classification for 
complex data using pure complex kernels, or complexified real ones, exploiting the recently 
developed Wirtinger's calculus for complex RKHS's and the notions of widely linear esti- 
mation. We showed that this problem is equivalent to solving two separate real SVM tasks 
employing an induced real kernel (figure [2|). The induced kernel depends on the choice of 
the complex kernel and it is not one of the usual kernels used in the literature. Although 
the machinery presented here might seem similar to the dual channel approach, they have 
important differences. The most important one is due to the incorporation of the induced 
kernel ki, which allows us to exploit the complex structure of the space, which is lost in 
the dual channel approach. As an example we studied the complex Gaussian kernel and 
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showed by example that the induced kernel is not the real Gaussian RBF. To the best of our 
knowledge this kernel has not appeared before in the literature. Hence, treating complex 
tasks directly in the complex plane, opens the way of employing novel kernels. 

Furthermore, for the classification problem we have shown that the complex SVM solves 
directly a quaternary problem, instead of the binary problem, that it is associated to the 
real SVM. Hence, the complex SVM not only provides the means for treating complex 
inputs, but also offers an alternative strategy to address multiclassification problems. In 
this way, such problems can be solved faster (needing about the half time), at a cost of 
increased error rate (in our experiment the increase was about 19%). Although, in the 
present work we focused on the 4 classes problem only, it is evident that the same rationale 
can be carried out to any multidimensional problem, were the classes must be divided into 
four groups each time, following a rationale similar to the one-versus-all mechanism. This 
will be addressed at a future time. 
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